Skip to content

feat: add CLI shell and exec commands for pod terminal access#124

Open
V2arK wants to merge 39 commits intomainfrom
honglin/shell-forwarding-clean
Open

feat: add CLI shell and exec commands for pod terminal access#124
V2arK wants to merge 39 commits intomainfrom
honglin/shell-forwarding-clean

Conversation

@V2arK
Copy link
Copy Markdown
Contributor

@V2arK V2arK commented Mar 12, 2026

Summary

  • Add centml cluster shell <id> for interactive terminal sessions (like docker exec -it)
  • Add centml cluster exec <id> -- <command> for running commands and returning output (like ssh host "cmd")
  • Both connect via WebSocket through the Platform API terminal proxy, matching the same protocol used by the Web UI TerminalView
  • Add get_status_v3() to SDK client for pod discovery
  • Add websockets>=13.0 dependency

Summary by CodeRabbit

  • New Features

    • Added shell and exec cluster subcommands for interactive and non-interactive pod terminal access.
    • Added get_status_v3() API method for deployment status retrieval.
  • Chores

    • Added runtime dependencies: websockets and pyte.
  • Tests

    • Added comprehensive tests for CLI shell commands, SDK shell session/renderer, and test configuration behavior.

@V2arK V2arK marked this pull request as draft March 12, 2026 18:12
@V2arK V2arK self-assigned this Mar 12, 2026
@V2arK V2arK force-pushed the honglin/shell-forwarding-clean branch 2 times, most recently from 0073b53 to 095bd1e Compare March 12, 2026 18:24
@V2arK V2arK requested a review from michaelshin March 13, 2026 18:50
@V2arK V2arK marked this pull request as ready for review March 15, 2026 16:47
@michaelshin michaelshin requested a review from anandj91 March 17, 2026 17:53
@michaelshin michaelshin requested a review from anandj91 March 17, 2026 20:23
@anandj91
Copy link
Copy Markdown
Contributor

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@click.option("--pod", default=None, help="Specify a pod name")
@click.option("--shell", "shell_type", default=None, type=click.Choice(["bash", "sh", "zsh"]), help="Shell type")
@click.option(
"--first-pod", is_flag=True, default=False, help="Auto-select the first running pod (skip interactive selection)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this right? if --pod is not provided, then we default to first pod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple replicas, then we interactively ask the user for a pod. --first-pod ensures that we apply the command to the first pod

@anandj91
Copy link
Copy Markdown
Contributor

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

Walkthrough

Adds WebSocket-based interactive and non-interactive shell commands to the CLI, SDK session utilities and exceptions, a v3 status API method, new runtime deps, and comprehensive tests for pod resolution, WS sessions, and terminal rendering.

Changes

Cohort / File(s) Summary
CLI Shell Commands
centml/cli/main.py, centml/cli/shell.py
Register shell and exec subcommands on the ccluster Click group. Implement pod resolution (explicit, first, interactive), token retrieval, WS URL construction, and invocation of interactive or exec async sessions.
SDK Shell Core
centml/sdk/shell/__init__.py, centml/sdk/shell/exceptions.py, centml/sdk/shell/session.py
Introduce shell package: exception hierarchy (ShellError, NoPodAvailableError, PodNotFoundError), WS URL builder, running-pod enumeration, interactive and exec session implementations with marker-based output extraction and pyte-based ANSI handling, IO forwarding, terminal sizing, and signal handling.
SDK API Extensions
centml/sdk/api.py
Add CentMLClient.get_status_v3(deployment_id) forwarding to the V3 status endpoint.
Dependencies
requirements.txt
Add websockets>=13.0 and pyte>=0.8.0.
Test Configuration
tests/conftest.py
Add pytest pytest_configure hook and PyTorch test file allowlist to skip selected files during --sanity runs.
CLI Shell Tests
tests/test_cli_shell.py
New tests for CLI shell and exec commands covering TTY checks, pod resolution (explicit/first/interactive/single/none), option forwarding, and command passthrough via --.
SDK Shell Tests
tests/test_sdk_shell_session.py, tests/test_sdk_shell_renderer.py
New tests for WS URL construction, exec session message sequencing and marker-based parsing/exit-code extraction, forward_io termination behavior, interactive session terminal restoration, and ANSI-stripping renderer behavior.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as "CLI Shell"
    participant API as "CentML API"
    participant WS as "WebSocket / Pod"

    User->>CLI: Invoke shell/exec command
    CLI->>API: Query deployment status / running pods
    API-->>CLI: Pod list
    alt Single pod or --first-pod
        CLI->>CLI: Auto-select pod
    else Multiple pods
        CLI->>User: Prompt for pod
        User-->>CLI: Select pod
    end
    CLI->>API: Request auth token
    API-->>CLI: Token
    CLI->>CLI: Build WS URL
    CLI->>WS: Connect with token
    WS-->>CLI: Connection established
    alt Interactive
        CLI->>WS: Resize PTY
        rect rgba(100, 150, 200, 0.5)
            loop Bidirectional I/O
                User->>CLI: stdin
                CLI->>WS: send stdin frames
                WS->>CLI: stdout/stderr frames
                CLI->>User: display output
            end
        end
    else Exec (non-interactive)
        CLI->>WS: Resize + send wrapped command (with markers)
        rect rgba(100, 150, 200, 0.5)
            WS->>CLI: Output including BEGIN/END markers
            CLI->>CLI: Parse output, extract exit code
        end
        CLI->>User: show output and exit code
    end
    CLI->>WS: Close connection
    WS-->>CLI: Closed
    CLI->>User: Return exit code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐇 I nibble at code, whiskers all a-quiver,
WebSockets hum as pods deliver,
Markers blink, terminals sing,
Async hops make output spring,
A tiny rabbit cheers: shells now shiver!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main additions: CLI shell and exec commands for pod terminal access, which aligns with the core changes introducing interactive and non-interactive terminal functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch honglin/shell-forwarding-clean
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch honglin/shell-forwarding-clean

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
tests/test_sdk_shell_session.py (1)

315-323: Consider explicitly managing the background shutdown task.

The fire-and-forget task pattern here works, but it's implicit. While Python 3.10+ guarantees the task won't be collected, explicitly retaining a reference and cleaning it up in a finally block (or using a context manager) would make the intent clearer and reduce the risk of subtle issues if the code evolves.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_sdk_shell_session.py` around lines 315 - 323, The test starts a
fire-and-forget task via asyncio.create_task(_set_shutdown()) without keeping a
reference; change the _run coroutine to store the task (e.g., shutdown_task =
asyncio.create_task(_set_shutdown())), then wrap the call to forward_io(ws, [80,
24], shutdown) in a try/finally where in finally you cancel the shutdown_task
(if still running) and await it to ensure cleanup; reference symbols to edit:
_run, _set_shutdown, shutdown, forward_io, and the asyncio.create_task call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@centml/cli/shell.py`:
- Around line 41-56: The _connect_args function can block by calling _select_pod
(which uses click.prompt) when multiple pods exist; modify _connect_args to
avoid prompting when stdin is not a TTY: import sys and before calling
_select_pod check sys.stdin.isatty(), and if it's False and neither pod nor
first_pod is provided raise click.ClickException instructing the caller to pass
--pod or --first-pod; keep the existing behavior (call _select_pod) only when
sys.stdin.isatty() is True. Ensure this logic is applied inside the
_connect_args function surrounding the branch that currently sets pod_name =
_select_pod(running_pods, deployment_id).
- Around line 89-91: The exec_cmd function currently concatenates the
Click-provided command tuple with " ".join(command), which loses
quoting/escaping; replace that with shlex.join(command) and ensure shlex is
imported so exec_cmd calls asyncio.run(exec_session(ws_url, token,
shlex.join(command))) to preserve argument boundaries for POSIX shells (refer to
exec_cmd and exec_session symbols).

In `@centml/sdk/shell/session.py`:
- Around line 226-229: The code currently initializes exit_code = 0 and silently
falls through on ConnectionClosed, causing a false "success" when END_MARKER
wasn't received; update the ConnectionClosed handling in
centml/sdk/shell/session.py to treat a socket close before detecting END_MARKER
as a transport failure by either setting exit_code to a non-zero value (e.g., 1)
or raising an exception (e.g., RuntimeError("socket closed before END_MARKER")),
and ensure this logic checks the buffer/is_done/END_MARKER state so you only
mark failure when the trailer was not parsed; apply the same change to the
second occurrence referenced (around lines 258-260) so both code paths return
non-zero or raise on missing END_MARKER.
- Around line 11-12: The code calls websockets.connect(...) with
additional_headers (in centml/sdk/shell/session.py at the two call sites that
currently use websockets.connect around lines 162 and 208) which is only
supported by the new asyncio client; change the import to use the asyncio client
explicitly by importing connect from websockets.asyncio.client and replace
websockets.connect(...) calls with connect(...), or alternatively raise the
minimum requirement to websockets>=14.0 in requirements so the existing import
"import websockets" remains valid—pick one approach and apply it consistently to
both call sites.

In `@tests/test_cli_shell.py`:
- Around line 43-68: The test helper _patch_deps currently only patches
asyncio.run, causing real coroutine objects from interactive_session and
exec_session to be created (leading to un-awaited coroutine warnings and
preventing assertions); update _patch_deps to also patch
centml.cli.shell.interactive_session and centml.cli.shell.exec_session (or have
the patched asyncio.run explicitly await/close a fake coroutine) so tests can
inspect calls and avoid warnings; ensure the patched names (interactive_session,
exec_session) return a simple mock/coroutine that records the URL, shell type,
and command used by shell() and exec_cmd() for assertions.

---

Nitpick comments:
In `@tests/test_sdk_shell_session.py`:
- Around line 315-323: The test starts a fire-and-forget task via
asyncio.create_task(_set_shutdown()) without keeping a reference; change the
_run coroutine to store the task (e.g., shutdown_task =
asyncio.create_task(_set_shutdown())), then wrap the call to forward_io(ws, [80,
24], shutdown) in a try/finally where in finally you cancel the shutdown_task
(if still running) and await it to ensure cleanup; reference symbols to edit:
_run, _set_shutdown, shutdown, forward_io, and the asyncio.create_task call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1295c58d-f50f-4796-812b-7dac84fb274d

📥 Commits

Reviewing files that changed from the base of the PR and between 470a8b6 and 3ff11ee.

📒 Files selected for processing (11)
  • centml/cli/main.py
  • centml/cli/shell.py
  • centml/sdk/api.py
  • centml/sdk/shell/__init__.py
  • centml/sdk/shell/exceptions.py
  • centml/sdk/shell/session.py
  • requirements.txt
  • tests/conftest.py
  • tests/test_cli_shell.py
  • tests/test_sdk_shell_renderer.py
  • tests/test_sdk_shell_session.py

Comment on lines +226 to +229
exit_code = 0
buffer = ""
is_capturing = False
is_done = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't report success when the socket closes before END_MARKER.

exit_code starts at 0, and the ConnectionClosed path just falls through. If the pod or proxy disconnects before the trailer is parsed, centml cluster exec returns success even though the command outcome is unknown. Please treat “no end marker received” as a transport failure and return non-zero, or raise instead.

Suggested fix
-        exit_code = 0
+        exit_code = None
@@
-        except websockets.ConnectionClosed:
-            pass
-        return exit_code
+        except websockets.ConnectionClosed:
+            if not is_done:
+                sys.stderr.write("Error: terminal session closed before command completion\n")
+                return 1
+        return exit_code if is_done and exit_code is not None else 1

Also applies to: 258-260

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/sdk/shell/session.py` around lines 226 - 229, The code currently
initializes exit_code = 0 and silently falls through on ConnectionClosed,
causing a false "success" when END_MARKER wasn't received; update the
ConnectionClosed handling in centml/sdk/shell/session.py to treat a socket close
before detecting END_MARKER as a transport failure by either setting exit_code
to a non-zero value (e.g., 1) or raising an exception (e.g.,
RuntimeError("socket closed before END_MARKER")), and ensure this logic checks
the buffer/is_done/END_MARKER state so you only mark failure when the trailer
was not parsed; apply the same change to the second occurrence referenced
(around lines 258-260) so both code paths return non-zero or raise on missing
END_MARKER.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
centml/cli/shell.py (2)

41-60: ⚠️ Potential issue | 🟠 Major

Guard against prompting in non-interactive contexts.

exec_cmd can be invoked non-interactively (CI, scripts), but _connect_args may call _select_pod which uses click.prompt(). This will hang or fail when multiple pods exist and --pod/--first-pod is not specified. Consider requiring --pod or --first-pod when stdin is not a TTY, or defaulting to first pod in non-interactive mode.

🛠️ Proposed fix
 def _connect_args(deployment_id, pod, shell_type, first_pod=False):
     """Resolve pod, build WebSocket URL, and obtain auth token."""
     with get_centml_client() as cclient:
         running_pods = get_running_pods(cclient, deployment_id)
         if not running_pods:
             raise click.ClickException(f"No running pods found for deployment {deployment_id}")
 
         if pod is not None:
             try:
                 pod_name = _resolve_pod(running_pods, pod)
             except ShellError as exc:
                 raise click.ClickException(str(exc)) from exc
         elif len(running_pods) == 1 or first_pod:
             pod_name = running_pods[0]
+        elif not sys.stdin.isatty():
+            raise click.ClickException(
+                "Multiple pods found. Specify --pod or --first-pod for non-interactive use."
+            )
         else:
             pod_name = _select_pod(running_pods, deployment_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/cli/shell.py` around lines 41 - 60, The _connect_args function can
block by calling _select_pod (which uses click.prompt) in non-interactive runs;
fix by detecting interactivity (e.g. using sys.stdin.isatty()) and if not
interactive and pod is None and not first_pod and len(running_pods) > 1, do not
call _select_pod: instead raise a click.ClickException telling the user to
provide --pod or --first-pod (or alternatively choose to default to
running_pods[0] if you prefer non-interactive default behavior); implement this
check inside _connect_args before calling _select_pod so build_ws_url and
auth.get_centml_token remain unchanged.

80-92: ⚠️ Potential issue | 🟠 Major

Use shlex.join() to preserve command argument boundaries.

" ".join(command) loses quoting and escaping. Arguments like "hello world", $HOME, or * will be misinterpreted by the pod shell. Use shlex.join(command) for correct POSIX shell escaping.

🛠️ Proposed fix
+import shlex
 ...
 def exec_cmd(deployment_id, command, pod, shell_type, first_pod):
     ws_url, token = _connect_args(deployment_id, pod, shell_type, first_pod)
-    exit_code = asyncio.run(exec_session(ws_url, token, " ".join(command)))
+    exit_code = asyncio.run(exec_session(ws_url, token, shlex.join(command)))
     sys.exit(exit_code)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/cli/shell.py` around lines 80 - 92, The pod exec command concatenates
arguments with " ".join(command) which loses quoting/escaping; change exec_cmd
to construct the shell string with shlex.join(command) before calling
exec_session so argument boundaries are preserved (update the call in exec_cmd
that currently does asyncio.run(exec_session(ws_url, token, " ".join(command)))
to use shlex.join(command)); ensure you import shlex at top of
centml/cli/shell.py if not already present.
🧹 Nitpick comments (4)
centml/sdk/shell/__init__.py (1)

4-12: Consider sorting __all__ for consistency.

Ruff (RUF022) flags that __all__ is not sorted. Sorting improves maintainability and aligns with isort-style conventions.

♻️ Proposed fix
 __all__ = [
-    "ShellError",
     "NoPodAvailableError",
      "PodNotFoundError",
+    "ShellError",
     "build_ws_url",
+    "exec_session",
     "get_running_pods",
     "interactive_session",
-    "exec_session",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/sdk/shell/__init__.py` around lines 4 - 12, The __all__ list is
unsorted (RUF022); sort the exported names alphabetically to satisfy Ruff and
maintain consistency — reorder the entries in __all__ so identifiers like
"NoPodAvailableError", "PodNotFoundError", "ShellError", "build_ws_url",
"exec_session", "get_running_pods", "interactive_session" appear in ascending
lexicographic order (update the __all__ variable in centml/sdk/shell/__init__.py
where it's defined).
tests/test_sdk_shell_session.py (1)

315-323: Store the task reference to prevent potential garbage collection.

The asyncio.create_task() return value is not stored, which could theoretically allow the task to be garbage collected before completion. While this test likely works in practice, storing the reference is the recommended pattern.

♻️ Suggested fix
                 async def _run():
                     async def _set_shutdown():
                         await asyncio.sleep(0.1)
                         shutdown.set()

-                    asyncio.create_task(_set_shutdown())
+                    _ = asyncio.create_task(_set_shutdown())
                     return await forward_io(ws, [80, 24], shutdown)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_sdk_shell_session.py` around lines 315 - 323, The create_task call
in the local coroutine _run uses asyncio.create_task(_set_shutdown()) without
storing its return value, which risks the task being garbage-collected; assign
the created Task to a local variable (e.g., shutdown_task or setter_task) inside
_run so the reference lives at least until forward_io(ws, [80, 24], shutdown)
completes (optionally await or let it finish naturally), keeping function names
_run and _set_shutdown unchanged.
centml/sdk/shell/session.py (2)

127-136: Consider narrowing the exception catch during task cleanup.

The except (asyncio.CancelledError, Exception) catch is broad and silently suppresses all exceptions from cancelled tasks. While this is common in cleanup code, it could mask unexpected errors. Consider catching specific expected exceptions or at minimum logging unexpected ones.

♻️ Suggested improvement
     for t in pending:
         try:
             await t
-        except (asyncio.CancelledError, Exception):
-            pass
+        except asyncio.CancelledError:
+            pass  # Expected for cancelled tasks
+        except websockets.ConnectionClosed:
+            pass  # Expected during shutdown
+        except Exception:
+            pass  # Suppress other errors during cleanup
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/sdk/shell/session.py` around lines 127 - 136, The cleanup loop
currently swallows all exceptions with "except (asyncio.CancelledError,
Exception): pass", which can hide real errors; update the loop that awaits tasks
in "pending" to handle asyncio.CancelledError separately and to catch Exception
as "e" and log it (e.g. using the module/session logger) instead of silencing
it, so reference the variables pending, done, and t and replace the broad except
with a specific asyncio.CancelledError branch and an Exception branch that logs
the unexpected exception details.

164-168: Unhandled exceptions from fire-and-forget coroutine.

asyncio.ensure_future(ws.send(...)) schedules the send but doesn't handle potential exceptions. If the WebSocket is closed when a resize signal arrives, this could produce unhandled exception warnings.

♻️ Suggested fix with exception handling
             def _send_resize():
                 c, r = shutil.get_terminal_size()
                 term_size[0], term_size[1] = c, r
-                asyncio.ensure_future(ws.send(json.dumps({"operation": "resize", "rows": r, "cols": c})))
+                async def _do_resize():
+                    try:
+                        await ws.send(json.dumps({"operation": "resize", "rows": r, "cols": c}))
+                    except websockets.ConnectionClosed:
+                        pass  # Connection already closed, ignore resize
+                asyncio.ensure_future(_do_resize())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@centml/sdk/shell/session.py` around lines 164 - 168, The _send_resize closure
currently calls asyncio.ensure_future(ws.send(...)) which can raise unhandled
exceptions if the websocket is closed; change this to schedule a coroutine that
catches and logs or ignores exceptions from ws.send. Create an async helper
(e.g., async def _safe_send_resize(rows, cols): try: await ws.send(...); except
Exception as e: process or debug-log the error) and call
asyncio.create_task(_safe_send_resize(r, c)) inside _send_resize so resize
failures are handled gracefully; reference _send_resize, ws.send, term_size, and
the existing asyncio scheduling call when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@centml/cli/shell.py`:
- Around line 41-60: The _connect_args function can block by calling _select_pod
(which uses click.prompt) in non-interactive runs; fix by detecting
interactivity (e.g. using sys.stdin.isatty()) and if not interactive and pod is
None and not first_pod and len(running_pods) > 1, do not call _select_pod:
instead raise a click.ClickException telling the user to provide --pod or
--first-pod (or alternatively choose to default to running_pods[0] if you prefer
non-interactive default behavior); implement this check inside _connect_args
before calling _select_pod so build_ws_url and auth.get_centml_token remain
unchanged.
- Around line 80-92: The pod exec command concatenates arguments with "
".join(command) which loses quoting/escaping; change exec_cmd to construct the
shell string with shlex.join(command) before calling exec_session so argument
boundaries are preserved (update the call in exec_cmd that currently does
asyncio.run(exec_session(ws_url, token, " ".join(command))) to use
shlex.join(command)); ensure you import shlex at top of centml/cli/shell.py if
not already present.

---

Nitpick comments:
In `@centml/sdk/shell/__init__.py`:
- Around line 4-12: The __all__ list is unsorted (RUF022); sort the exported
names alphabetically to satisfy Ruff and maintain consistency — reorder the
entries in __all__ so identifiers like "NoPodAvailableError",
"PodNotFoundError", "ShellError", "build_ws_url", "exec_session",
"get_running_pods", "interactive_session" appear in ascending lexicographic
order (update the __all__ variable in centml/sdk/shell/__init__.py where it's
defined).

In `@centml/sdk/shell/session.py`:
- Around line 127-136: The cleanup loop currently swallows all exceptions with
"except (asyncio.CancelledError, Exception): pass", which can hide real errors;
update the loop that awaits tasks in "pending" to handle asyncio.CancelledError
separately and to catch Exception as "e" and log it (e.g. using the
module/session logger) instead of silencing it, so reference the variables
pending, done, and t and replace the broad except with a specific
asyncio.CancelledError branch and an Exception branch that logs the unexpected
exception details.
- Around line 164-168: The _send_resize closure currently calls
asyncio.ensure_future(ws.send(...)) which can raise unhandled exceptions if the
websocket is closed; change this to schedule a coroutine that catches and logs
or ignores exceptions from ws.send. Create an async helper (e.g., async def
_safe_send_resize(rows, cols): try: await ws.send(...); except Exception as e:
process or debug-log the error) and call
asyncio.create_task(_safe_send_resize(r, c)) inside _send_resize so resize
failures are handled gracefully; reference _send_resize, ws.send, term_size, and
the existing asyncio scheduling call when making the change.

In `@tests/test_sdk_shell_session.py`:
- Around line 315-323: The create_task call in the local coroutine _run uses
asyncio.create_task(_set_shutdown()) without storing its return value, which
risks the task being garbage-collected; assign the created Task to a local
variable (e.g., shutdown_task or setter_task) inside _run so the reference lives
at least until forward_io(ws, [80, 24], shutdown) completes (optionally await or
let it finish naturally), keeping function names _run and _set_shutdown
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 43be36f8-9df6-44f3-8f76-1d891c236df8

📥 Commits

Reviewing files that changed from the base of the PR and between 470a8b6 and 3ff11ee.

📒 Files selected for processing (11)
  • centml/cli/main.py
  • centml/cli/shell.py
  • centml/sdk/api.py
  • centml/sdk/shell/__init__.py
  • centml/sdk/shell/exceptions.py
  • centml/sdk/shell/session.py
  • requirements.txt
  • tests/conftest.py
  • tests/test_cli_shell.py
  • tests/test_sdk_shell_renderer.py
  • tests/test_sdk_shell_session.py

V2arK added 7 commits March 31, 2026 12:21
    Add two new commands under `centml cluster`:
    - `shell <id>` -- interactive terminal session (like docker exec -it)
    - `exec <id> -- <command>` -- run a command and return output (like ssh host "cmd")

    Both connect via WebSocket through the Platform API terminal proxy,
    matching the same protocol used by the Web UI TerminalView.
Replaces str.replace("https://", "wss://") with urllib.parse.urlparse
scheme replacement to avoid CodeQL py/incomplete-url-substring-sanitization.
Replace url.startswith() assertions with urllib.parse.urlparse() checks
to satisfy CodeQL py/incomplete-url-substring-sanitization rule. Reformat
both shell.py and test_shell.py with black.
V2arK and others added 26 commits March 31, 2026 12:21
Add pyte as local terminal emulator (equivalent to xterm.js) to solve
cursor positioning and line wrapping issues. Feed WebSocket output through
pyte Screen/Stream and render only dirty lines with ANSI escape sequences.
shutil.get_terminal_size() returns (columns, lines), not (rows, cols).
The swapped unpacking caused pyte Screen to be created with terminal
line count as width, making the display extremely narrow.
…andling

Replace regex-based _strip_ansi with pyte single-row screen for marker
detection. pyte interprets all VT100/VT220 sequences including OSC and
truecolor escapes that the regex could miss.
If two Code signals arrive within 3 seconds, the shell has exited and
the reconnect just opened a new session. Exit cleanly instead of
looping forever.
Logs to /tmp/centml_shell_debug.log (overridable via
CENTML_SHELL_DEBUG_LOG env var). Traces every WS message, stdin event,
task lifecycle, reconnect decision, and connection close.
The platform API proxy never forwards ArgoCD Code messages and does
not close the WebSocket when the remote shell exits.  Replace the
Code/reconnect logic with exit echo detection: when the server echoes
back "exit\r\n", arm a 2-second idle timeout on ws.recv().  If no
more data arrives, the shell has exited -- break out cleanly.

Also removes Code handling from _exec_session (markers already work).
…ompt

When "exit\r\n" appears at the end of ws data (nothing after it), the
shell has exited -- return immediately instead of waiting 2s.  When
"exit\r\n" is followed by a new prompt (e.g. from echo exit), ignore
it and continue the session.
@michaelshin michaelshin force-pushed the honglin/shell-forwarding-clean branch from 3ff11ee to 78cfb45 Compare March 31, 2026 16:22
@michaelshin michaelshin requested a review from anandj91 March 31, 2026 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants